语音编码有助于以最小的失真方式传播语音在低频带宽度网络上的传播。基于神经网络的语音编解码器最近表现出与传统方法相对于传统方法的显着改善。尽管这一新一代的编解码器能够综合高保真语音,但它们对经常性或卷积层的使用通常会限制其有效的接受场,从而阻止他们有效地压缩语音。我们建议通过使用经过预定的变压器进一步降低神经语音编解码器的比特率,该变压器能够由于其电感偏置而在输入信号中利用长距离依赖性。因此,我们与卷积编码器同时使用了经过验证的变压器,该卷积编码器是通过量化器和生成的对抗性净解码器进行训练的端到端。我们的数值实验表明,补充神经语音编解码器的卷积编码器,用变压器语音嵌入嵌入的语音编解码器,比特率为$ 600 \,\ m athrm {bps} $,在合成的语音质量中均超过原始的神经言语编解码器,当相同的比特率。主观的人类评估表明,所得编解码器的质量比运行率的三到四倍的传统编解码器的质量可比或更好。
translated by 谷歌翻译
最近的工作[4]分析了两次可差化函数的最佳解决方案附近的亚当局部融合。结果发现,学习率必须足够小,以确保最佳解决方案的局部稳定性。以上的收敛结果也适用于Adamw。在这项工作中,我们提出了一种新的自适应优化方法,通过在两个方面扩展Adamw,以便放宽对局部稳定性的小型学习率的要求,我们称之为AIDA。首先,我们考虑跟踪梯度大小的第2矩R_T。当p = 2时,r_t减少到adamw的v_t。假设{m_t}是adamw的第一个时刻。众所周知,更新方向M_ {T + 1} /(v_ {t + 1} + epsilon)^ 0.5(或m_ {t + 1} /(v_ {t + 1} ^ 0.5 + epsilon)的Adamw(或者亚当)可以被分解为标志向量符号(M_ {t + 1})通过幅度的向量乘以量数| m_ {t + 1} | /(v_ {t + 1} + epsilon)^ 0.5(或| m_ {t + 1} | /(v_ {t + 1} ^ 0.5 + epsilon)。Aida旨在以| m_ {t + 1} | ^ q /(r_ {t + 1} + epsilon)^(q / p)(或| m_ {t + 1} | ^ q /((r_ {t + 1})^(q / p)+ epsilon),减少到当(p,q)=(2,1)时的adamw。假设原点0是两次可差化函数的本地最佳解决方案。理论上是在理论上发现的,当Q> 1和P> 1在Aida中,原点0只有当重量衰减是非零时局部稳定。进行实验,用于解决十个玩具优化问题和训练变压器和训练变压器和Swin变压器,为两个深度学习(DL)任务。实证研究表明,在许多场景中表明(包括两个DL任务),AIDA具有特定设置(P,Q)不等于(2,1)优于Adamw的设置(p,q)=(2,1)。
translated by 谷歌翻译
The problem of domain generalization is to take knowledge acquired from a number of related domains where training data is available, and to then successfully apply it to previously unseen domains. We propose a new feature learning algorithm, Multi-Task Autoencoder (MTAE), that provides good generalization performance for crossdomain object recognition.Our algorithm extends the standard denoising autoencoder framework by substituting artificially induced corruption with naturally occurring inter-domain variability in the appearance of objects. Instead of reconstructing images from noisy versions, MTAE learns to transform the original image into analogs in multiple related domains. It thereby learns features that are robust to variations across domains. The learnt features are then used as inputs to a classifier.We evaluated the performance of the algorithm on benchmark image recognition datasets, where the task is to learn features from multiple datasets and to then predict the image label from unseen datasets. We found that (denoising) MTAE outperforms alternative autoencoder-based models as well as the current state-of-the-art algorithms for domain generalization.
translated by 谷歌翻译
Making histopathology image classifiers robust to a wide range of real-world variability is a challenging task. Here, we describe a candidate deep learning solution for the Mitosis Domain Generalization Challenge 2022 (MIDOG) to address the problem of generalization for mitosis detection in images of hematoxylin-eosin-stained histology slides under high variability (scanner, tissue type and species variability). Our approach consists in training a rotation-invariant deep learning model using aggressive data augmentation with a training set enriched with hard negative examples and automatically selected negative examples from the unlabeled part of the challenge dataset. To optimize the performance of our models, we investigated a hard negative mining regime search procedure that lead us to train our best model using a subset of image patches representing 19.6% of our training partition of the challenge dataset. Our candidate model ensemble achieved a F1-score of .697 on the final test set after automated evaluation on the challenge platform, achieving the third best overall score in the MIDOG 2022 Challenge.
translated by 谷歌翻译
Supervised Question Answering systems (QA systems) rely on domain-specific human-labeled data for training. Unsupervised QA systems generate their own question-answer training pairs, typically using secondary knowledge sources to achieve this outcome. Our approach (called PIE-QG) uses Open Information Extraction (OpenIE) to generate synthetic training questions from paraphrased passages and uses the question-answer pairs as training data for a language model for a state-of-the-art QA system based on BERT. Triples in the form of <subject, predicate, object> are extracted from each passage, and questions are formed with subjects (or objects) and predicates while objects (or subjects) are considered as answers. Experimenting on five extractive QA datasets demonstrates that our technique achieves on-par performance with existing state-of-the-art QA systems with the benefit of being trained on an order of magnitude fewer documents and without any recourse to external reference data sources.
translated by 谷歌翻译
While the capabilities of autonomous systems have been steadily improving in recent years, these systems still struggle to rapidly explore previously unknown environments without the aid of GPS-assisted navigation. The DARPA Subterranean (SubT) Challenge aimed to fast track the development of autonomous exploration systems by evaluating their performance in real-world underground search-and-rescue scenarios. Subterranean environments present a plethora of challenges for robotic systems, such as limited communications, complex topology, visually-degraded sensing, and harsh terrain. The presented solution enables long-term autonomy with minimal human supervision by combining a powerful and independent single-agent autonomy stack, with higher level mission management operating over a flexible mesh network. The autonomy suite deployed on quadruped and wheeled robots was fully independent, freeing the human supervision to loosely supervise the mission and make high-impact strategic decisions. We also discuss lessons learned from fielding our system at the SubT Final Event, relating to vehicle versatility, system adaptability, and re-configurable communications.
translated by 谷歌翻译
While the brain connectivity network can inform the understanding and diagnosis of developmental dyslexia, its cause-effect relationships have not yet enough been examined. Employing electroencephalography signals and band-limited white noise stimulus at 4.8 Hz (prosodic-syllabic frequency), we measure the phase Granger causalities among channels to identify differences between dyslexic learners and controls, thereby proposing a method to calculate directional connectivity. As causal relationships run in both directions, we explore three scenarios, namely channels' activity as sources, as sinks, and in total. Our proposed method can be used for both classification and exploratory analysis. In all scenarios, we find confirmation of the established right-lateralized Theta sampling network anomaly, in line with the temporal sampling framework's assumption of oscillatory differences in the Theta and Gamma bands. Further, we show that this anomaly primarily occurs in the causal relationships of channels acting as sinks, where it is significantly more pronounced than when only total activity is observed. In the sink scenario, our classifier obtains 0.84 and 0.88 accuracy and 0.87 and 0.93 AUC for the Theta and Gamma bands, respectively.
translated by 谷歌翻译
Vocal Bursts -- short, non-speech vocalizations that convey emotions, such as laughter, cries, sighs, moans, and groans -- are an often-overlooked aspect of speech emotion recognition, but an important aspect of human vocal communication. One barrier to study of these interesting vocalizations is a lack of large datasets. I am pleased to introduce the EmoGator dataset, which consists of 32,040 samples from 365 speakers, 16.91 hours of audio; each sample classified into one of 30 distinct emotion categories by the speaker. Several different approaches to construct classifiers to identify emotion categories will be discussed, and directions for future research will be suggested. Data set is available for download from https://github.com/fredbuhl/EmoGator.
translated by 谷歌翻译
Charisma is considered as one's ability to attract and potentially also influence others. Clearly, there can be considerable interest from an artificial intelligence's (AI) perspective to provide it with such skill. Beyond, a plethora of use cases opens up for computational measurement of human charisma, such as for tutoring humans in the acquisition of charisma, mediating human-to-human conversation, or identifying charismatic individuals in big social data. A number of models exist that base charisma on various dimensions, often following the idea that charisma is given if someone could and would help others. Examples include influence (could help) and affability (would help) in scientific studies or power (could help), presence, and warmth (both would help) as a popular concept. Modelling high levels in these dimensions for humanoid robots or virtual agents, seems accomplishable. Beyond, also automatic measurement appears quite feasible with the recent advances in the related fields of Affective Computing and Social Signal Processing. Here, we, thereforem present a blueprint for building machines that can appear charismatic, but also analyse the charisma of others. To this end, we first provide the psychological perspective including different models of charisma and behavioural cues of it. We then switch to conversational charisma in spoken language as an exemplary modality that is essential for human-human and human-computer conversations. The computational perspective then deals with the recognition and generation of charismatic behaviour by AI. This includes an overview of the state of play in the field and the aforementioned blueprint. We then name exemplary use cases of computational charismatic skills before switching to ethical aspects and concluding this overview and perspective on building charisma-enabled AI.
translated by 谷歌翻译
Research has shown that climate change creates warmer temperatures and drier conditions, leading to longer wildfire seasons and increased wildfire risks in the United States. These factors have in turn led to increases in the frequency, extent, and severity of wildfires in recent years. Given the danger posed by wildland fires to people, property, wildlife, and the environment, there is an urgency to provide tools for effective wildfire management. Early detection of wildfires is essential to minimizing potentially catastrophic destruction. In this paper, we present our work on integrating multiple data sources in SmokeyNet, a deep learning model using spatio-temporal information to detect smoke from wildland fires. Camera image data is integrated with weather sensor measurements and processed by SmokeyNet to create a multimodal wildland fire smoke detection system. We present our results comparing performance in terms of both accuracy and time-to-detection for multimodal data vs. a single data source. With a time-to-detection of only a few minutes, SmokeyNet can serve as an automated early notification system, providing a useful tool in the fight against destructive wildfires.
translated by 谷歌翻译